Deep learning image segmentation reveals patterns of UV reflectance evolution in passerine birds

Ultraviolet colouration is thought to be an important form of signalling in many bird species, yet broad insights regarding the prevalence of ultraviolet plumage colouration and the factors promoting its evolution are currently lacking. In this paper, we develop a image segmentation pipeline based on deep learning that considerably outperforms classical (i.e. non deep learning) segmentation methods, and use this to extract accurate information on whole-body plumage colouration from photographs of >24,000 museum specimens covering >4500 species of passerine birds. Our results demonstrate that ultraviolet reflectance, particularly as a component of other colours, is widespread across the passerine radiation but is strongly phylogenetically conserved. We also find clear evidence in support of the role of light environment in promoting the evolution of ultraviolet plumage colouration, and a weak trend towards higher ultraviolet plumage reflectance among bird species with ultraviolet rather than violet-sensitive visual systems. Overall, our study provides important broad-scale insight into an enigmatic component of avian colouration, as well as demonstrating that deep learning has considerable promise for allowing new data to be brought to bear on long-standing questions in ecology and evolution.


Supplementary Note 1. Methods and Results from Additional Model Testing (i) Effects of input resolution on the performance
DeepLabv2 has been shown to perform better when using the original input resolution compared to resized resolutions 1 , in particular downscaled images result in lower accuracy for pose estimation and classification 2 . We compared resolutions that were 8, 10 and 16 times lower than the input images (i.e. 618 x 410, 494 x 328 pixels and 309 x 205 pixels) to test whether performance degrades at lower resolutions.

(ii) Effects of input channels on the performance
Previous studies have included non-visible light (e.g. UV and IR) information as the input in deep learning tasks, sometimes leading to better performance when compared to using only RGB channels 3,4,5 . Our dataset includes two sets of images, one filtered to include only human visible (RGB) wavelengths and one to include only UV wavelengths, because bird plumage frequently includes UV reflecting regions. All images were taken against a black background made of theatre blackout curtains with very low reflectance of the UV light. The specimens should therefore reflect more UV light than the background. To test whether the inclusion of UV improved network performance, models were trained with (i) images using RGB channels only, (ii) images using UV channels only and (iii) images using RGB plus UV channels.

(iii) Effects of image augmentation on the performance
Image augmentation is a common technique that increases the size of the training set by creating new labelled training images from manipulating the existing images and their labels, and has been shown to improve the model performance of DeepLabv2 1 . We created an augmented training set from the original training set in which images and their segmentations were randomly rotated (-15° to -1°, 1° to 15°), translated in both x and y axes (100 to 500 pixels), scaled (0.1 to 1.1). We used the augmented dataset to train the model with evaluation performed on the original validation set.
IOU was marginally, albeit significantly, higher (t[10186] = 5.90, P < 0.05) for the original dataset (mean = 93.1, standard deviation [SD] = 3.24) than for the augmented dataset (mean = 92.7, SD = 3.35) ( Supplementary Fig. 3). We also found a similar marginal yet significant difference in the precision (t[10186] = 6.63, P < 0.05) and again the original dataset (mean = 96.3, SD = 2.38) outperformed the augmented dataset (mean = 96.0, SD = 2.56) ( Supplementary Fig. 3). However, there was no significant difference in the recall (t[10186] = 1.81, P = 0.07) ( Supplementary Fig. 3). Our original training sets consists of highly standardised sets of images and we suspect that these minor reductions in performance associated with models trained on augmented datasets are caused by the introduction of non-standard (i.e. rotated, translated and scaled) images into the training set.

(iv) Effects of subsetting models on the performance
In our core pipeline, we used one deep learning model on images from all views (all-views model), but image variations of different views may introduce difficulties for the network to learn. We therefore tested the impact of training and validated separate models for each of the three image views (back, belly and side). This reduces the input data for each model run to 1698 images (compared to 5094 images).
We found that subsetting models by image view (i.e. back, belly, side) was significantly worse than using the all-views model, except for recall on the side view (t[3394] = 1.43, P = 0.15) ( Supplementary Fig. 4). The back view had the largest IOU difference (the all-views model has 0.7 higher IOU than individual models) and recall difference (recall of the allviews model is 0.5 higher), while the side view had the largest precision difference (precision of the all-views model is 0.4 higher) ( Supplementary Fig. 4).

(v) Quality of the training data
In our case, specimen images were taken in a highly consistent manner by controlling the placement of the specimen, light environment and background 6 . However, not all image datasets are likely to be so consistent due to practical limits (e.g. inadequate lighting). We tested whether greater variability in data quality could limit performance by generating artificial lower quality datasets. To do this, we applied a series of image manipulations in which (i) images were rotated (angles between -45° to 45°), translated (-500 to 500 pixels on x and y axes) and scaled (scale ratio from 0.8 to 1.2), (ii) 50% of images were randomly horizontally flipped, (iii) images were given new contrast and brightness (α from 0.5 to 2 and β from -50 to 50) using brightness and contrast adjustment functions in OpenCV 7 , and (iv) a combination of manipulations from (i), (ii) and (iii). We applied these operations to both training images and validation images (in contrast to image augmentation outlined above where we did not manipulate the validation set).  Supplementary Fig. 9). The 4 th dataset was 1.9%, 1.2% and 0.9% worse than the original dataset on IOU, precision and recall ( Supplementary Fig. 5b).

(vi) Training dataset size
We manually labelled 5094 images for this study. However, the number of labelled images may be limited by time and resources for other projects and studies. Here, we investigated the impact on deep learning accuracy using smaller training sets. Previous studies suggest that larger training set sizes may improve the performances of deep learning models 8,9 . We used a subset of 1018 images (20% of the dataset) as the only validation set for every result in this section. The training set (4076 images) was randomly sampled five times for one proportion selected from 15 proportions (1%, every 5% from 5% to 50% and every 10% from 50% to 90%). We found that model performance was positively related to the training set size following an approximately logarithmic pattern ( Supplementary Fig. 6). At least 10% of the dataset was required to attain IOU higher than 90%, at least 5% of the dataset to get precision and recall higher than 90%, and 15% of the dataset for precision and recall higher than 95%. With 100% of the dataset used for training, the model achieved 93.3% for IOU, 96.3% for precision and 96.8% for recall ( Supplementary Fig. 6).

(vii) Effects of the plumage colour and background contrast on performance
We explored how the contrast between plumage and non-plumage areas within images affected model performance. We assumed that plumages with high colour contrast to the surrounding non-plumage area would be easier to accurately segment in DeepLabv3+ and classic methods (thresholding, region growing, Chan-Vese, Graph Cut) than low contrast plumage areas. This is because humans and many classic segmentation methods usually segment targets well in images with high contrast between the foreground (e.g. plumage area) and the background (e.g. non-plumage area).
To test this, we measured the contrast by first calculating the per-pixel absolute Laplacian derivatives. A large Laplace derivative of a pixel indicates that the pixel is likely to be around the edges of an object or image feature (i.e. high contrast around this pixel), and has been widely used for image edge detection 10,11 . We used the mean Laplacian derivative of the pixels around the plumage area borders to represent the plumage area contrast. Pixels are selected from the difference between two segmentations based on two morphological transformations, erosion and dilation 12 . Erosion can be seen as zooming in a segmentation and dilation can be seen as zooming out. The 'zooming' effect is positively affected by the size of the transformation kernel (strength) and iterations (length). We used a kernel size of five for one iteration to create eroded and dilated segmentations on OpenCV. A large contrast value means that the plumage area is very different from its surrounding non-plumage area. We then calculated the Pearson's correlation coefficient (r) between accuracy metrics (IOU, precision and recall) and contrast values to evaluate the effect of contrast on performance.
For DeepLabv3+, we found that IOU, precision and recall values were generally high across all levels of contrast but marginally declined with increasing contrast between plumage colouration and the background (Supplementary Fig. 7). This slight decline in model performance with increasing contrast is opposite to the expectation that highcontrast specimens should be easier to segment than low-contrast specimens but can potentially be explained by lower sample sizes at higher contrasts resulting in more limited opportunities for our model to learn to accurately segment high-contrast birds. For classic methods (thresholding, region growing, Chan-Vese, Graph Cut), results were more mixed but generally conform to expectations that segmentation performance using thresholdbased methods is typically poorer for low-contrast specimens and generally (but not always) improves with increasing contrast (Supplementary Fig. 7).
On the basis of these results, we suggest that Deep Learning approaches to (specimen) image segmentation are generally robust to levels of contrast between specimens and the background, provided a consistent background colour is used and the model is sufficiently well trained on images of varying contrast. Supplementary Fig. 1 . Dataset (i) rotated (angles between -45° to 45°), translated (-500 to 500 pixels on x and y axes) and scaled (scale ratio from 0.8 to 1.2) images; Dataset (ii) horizontal flip 50% images randomly; Dataset (iii) images with random contrast and brightness; Dataset (iv) the combination of (i), (ii) and (iii). In box plots, a box indicates the median and first and third quartile, whiskers indicate range of data and points indicate outliers. (b) Plots of Tukey's test (95% family-wise confidence level) on whether metric (IOU, precision and recall) differences among tested datasets (n=5094) are significantly different (blue: significance; grey: no significance) from 0 (red dotted lines). In Tukey test plots, a point indicates the mean and whiskers indicate the 95% confidence intervals. Source data are provided as a Source Data file. Supplementary Fig. 6. The performances (IOU, precision and recall) of the same validation set (n=1018) using 15 proportions (1%, every 5% from 5% to 50% and every 10% from 50% to 90%) of the original training set. Source data are provided as a Source Data file. Supplementary Fig. 7. Performance metrics (from top to bottom: IOU, precision and recall) of predictions (n=5094) from different methods (from left to right: DeepLabv3+, thresholding, region growing, Chan-Vese, Graph Cut) in relation to the degree of contrast between plumage and non-plumage areas. Pearson correlation was used to test the correlation between performance and contrast. In all cases, tests were two-sided with no adjustments for multiple tests. Source data are provided as a Source Data file.  Supplementary Fig. 9. Examples of predictions using Dataset (iv) of the low-quality datasets. Images from Dataset (iv) were randomly rotated, translated, scaled, horizontal flipped and adjusted contrast and brightness. Supplementary Table 1. Overview of the four classic segmentation methods (thresholding, region growing, Chan-Vese, and graph cut) we tested on the avian specimen dataset.

Method Description
Thresholding 13,14 Thresholding segments an image by allocating each pixel to either the foreground or the background by comparing the pixel value (e.g. grayscale value) to a pre-defined threshold value. The threshold can be set either manually or automatically calculated based on image features such as the image histogram or entropy.
Region growing 15 Region growing first segments pixels that are the neighbour of the initial seeds (i.e. starting locations) into the foreground if the pixel value difference between an initial seed and a neighbour pixel is less than a pre-defined specific range. The algorithm keeps iterating this step until no pixel can be segmented.
Chan-Vese 16 Chan-Vese requires a closed contour to be first placed (often manually) on the image. The algorithm then transforms the contour to minimise the sum of the internal (i.e. regions inside the contour) and external (i.e. regions outside the contour) energy (related to pixel values). After the energy is minimised, the region inside the final contour is the segmentation.
Graph cut 17 Graph cut requires initial inputs of foreground and background seeds (pre-defined pixels that are from foreground or background). The algorithm treats an image as a graph, with pixels as the graph nodes. Every node (pixel) has three types of edges: (i) edges to the node's neighbour nodes; (ii) an edge to a source node (foreground); (iii) an edge to a sink node (background). Edge weights are based on pixel values and statuses (i.e. foreground, background or to be segmented).
The algorithm segments an image by dividing the graph into two subgraphs (foreground and background) that have the largest sum of weights, and the segmentation is the foreground subgraph.  Table X. Bayesian phylogenetic mixed model results for the effect of predictor variables on plumage UV reflectance in passerine species (n = 4,527). All variables were standardised (mean = 0, sd = 1) prior to model fitting. M, male; UVS, ultraviolet sensitive. *, P < 0.05; **, P < 0.01; ***, P < 0.001. All models were run over 100 posterior phylogenetic trees. Bold denotes statistically significant terms (P MCMC < 0.05).